Prosper is America’s first marketplace peer-to-peer lending platform, with over $12 billion in funded loans. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit score, etc.
In this project, an Exploratory Data Analysis (EDA) for the Prosper loan data set from 2005-2014 was conducted, in order to dig out some inspiring and valuable insights about the features that can affect the loan status.
- Introduction
- Exploratory Data Analysis 2.1 Univariate Plots Section 2.2 Univariate Analysis 2.3 Bivariate Plots Section 2.4 Bivariate Analysis 2.5 Multivariate Plots Section 2.6 Multivariate Analysis
- Final Plots and Summary
- Reflection
In this analysis, I used 9 packages for data manipulation, data visulization and correlation analysis. The name are below: ggplot2, date, dplyr, gridExtra, scales, choroplethrMaps, chroplethr, GGally, knitr.
The Prosper loan data contains 113,937 loans that were issued through 2005-2014, with 81 variables for each loan. The key feature is the loan status which could help investors estimate the default risk of borrowers. The interesting features that might affect the loan status could from numerical variables, including borrower rate,monthly income,loan amount, credit score, term of the loan, total prospers loans. The fetures could also from and categorical variables, including employment status, loan purpose, home ownership, borrower state.
In this section, I want to start exploring the data by reviewing some basic values and descriptive statistics. The fetures that I am interested to investigate are below:
In risk management one important job is to build a predictive model to predict whether the loan will be default or not. ‘delinquent’, ‘default’ and ‘charge off’ are terms helping investor to estimate whether or not they can expect to collect on the outstanding debt at all. According to the definition of delinquencies from badcredit.org, an account will be considered severely deliquent if no payment has been received 60 days beyond the due date.Those delinquent accounts may have higher risk of default.
So I created a new variable ‘DelinquentBorrowers’ using ‘0’ to represent none deliquent borrowers and ‘1’ to represent deliquent borroweres, including loans that are noted default, charged off, and past due over 60 days. According to LendingClub, In general, a note goes into Default status when it is 121 or more days past due. When a note is in Default status, Charge Off occurs no later than 150 days past due (i.e. No later than 30 days after the Default status is reached) when there is no reasonable expectation of sufficient payment to prevent the charge off.
## # A tibble: 2 x 3
## DelinquentBorrowers counts Percentage
## <dbl> <int> <dbl>
## 1 0 96294 84.5
## 2 1 17643 15.5
The loan categories are give as numerical variables, in order to investigate the correlation of loan purpose with other features, the ‘ListingCategory’ column was coverted into categorical variable.
Some people may have used most of their bank credit because they only have very small amount of available credit, or some people may only have use less than 10 % of their bank credit because their available credit is large. Most people only have one prosper loan. they might borrow money from prosper loan for some extrem situation. We can dig into those in later session.
There are 113937 observations of 81 variables. Data types including ‘factor’, ‘num’, and’int. Variables are about loan information and borrowers’ information
From an investors’s perspective, they would be interested in any features that might predict the risk of default from the borrowers. Those features could be the purpose of the loan, loan amont, loan status, the employment status, the borrowers’ creadit score,the borrowers’s income range, the number ofloans they have, the debt to income ratio, the number of account deliquent.
To study the borrower interest rate and loan status, I included ListingCreationDate, bank credit utilization, AvailableBankcardCredit, StatedMonthlyIncome and DebtToIncomeRatio.The borrower state could be intesting feature to explore because the average income varies from state to state. The homeowner might have higher risk of default, the employment status, employment status duration could also tell if the borrowers are able to pay off the loan ontime.
In order to investigate the loan amount over time. I created a new column named ‘LoanCreationDate’ in the format of ‘%m/%d/%Y’. Also, to investigate the purpose of the loan, I created a new column named ‘NewLisingCat’ to convert the numerical category of loan into categorical variables. Another new column named DelinquentBorrowers is also created to categorize the deliquent and none deliquent loans.
For the debt to income ratio, there is an outlier at ratio 10.01, which does not make sense. Also the column name of ListingCreationDate is weired, so I renamed the column name.
Scatter matrix plot is a fast way to provide us insight of the correlation between multiple paired variables at once. So the ggpairs function was used to get a general idea of the correlations amoung the variables. The scatter matrix shows that the credit grade is clearly one factor affect the interest rate. The correlation coefficient of other variables with the interest rate are very small, therefore, will be excluded from the predictive model.
- The loans borrowed from Prosper are small loans. The main purpose of the loans is for debt consolidation. People s who are employed with shoter employment peoride are more likely to borrow money from Prosper. Borrower with higher monthly income have higher prosper score and lower debt-to-income ratio.
- Borrower with same loan amount have very spread borrower rate. So the borrower rate might be depend on other features, such as the borrower’s income, employment status, loan purpose.
- The strongest relationship I found is the Prosper score vs. monthly income, and Debt-to-income ratio vs. monthly income. Borrower with higher monthly income have higher prosper score and lower debt-to-income ratio.
-In the Bivariable analysis session, I found that Borrower with higher income have higher Prosper score. This is consistent with this plot that delinquent borrowers have lower income. But delinquent borrowers are not always those with lower Prosper score.
- Borrower rate is related to prosper score, higher borrower rate leads to lower Prosper score. Delinquent borrowers might due to higher borrower rate.
- At the begining, I assumed employment status and loan purpose are two very important features that could help predict the default risk, however, there are large proportion of missing values for employment status and loan purpose.
- Deliquent borrowers are small loan borrowers. Also, deliquent borrowers tend to have higher Debt-to-income ratio.
It is interesting to know that Debt Consolidation is the main purpose for loan borrowers. But we can als see from the plot that there is a large portion of loan purpose noted as ‘other’ and ‘not available’. So we can remove these twoo categories to investigate the relation between deliquencies and loan purpose.
Borrowers state covers aross the 52 states. California, Texas, Illinois,Florida, and Georgia have the highest count of delinquents but it should be taken into consideration that those states have high level of population. also the more populous. While Florida, Texas and the East coast follow suite. “middle America” has the much lower levels of delinquent borrowers, but this might be due to the factors such as less populous and lower living cost.
There is strong evidence that higher borrower rate tend to lead higher deliquencies. So as investors, adjusting borower rate might increase the chance to collect the money back.
- The prospser loans dataset contains over 100k observations with 81 variables spanning across 10 years.The first step before conducting any data analysis is to understand the variables, terminology and general domain knowledge of financial peer-to-peer lending.Second, it is very important to determine which variables to analyze and stick to those variables without drifting too far off.Also there are so many missing values and none specific observations, cleaning is needed for looking into the relationship between some variables.
- For loan data, I believe that default risk analysis is a key component to help investors to decide if they could collect the full loan.So any features that could lead to default are worth investigating, In my analysis, I found that borrower rate from investor’s aspect and Prosper score from borrower’s aspect are strongly relatted to deliquencies. This brought me to investigate more features that could possibly affect borrower rate and prosper score, including monthly income, borrower region, loan purpose, employment status, credit card utilization, available credit, loan amount,home ownership, and debt to income ratio.
The Exploratory Data Analysis strategy is a good way to find some insights of the date through interesting visualizations. However, EDA works better for data set with limited variables. For the Prosper Loan data with 81 variables, it is really time consuming to find out the most intesting features and investigate the correlations between them. In the future,an useful method to investigate this data could be building predictive models using machine learning.